Dataset: a tidy data set from 2009 testing the chemical properties of Vinho Verde red wine. At least 3 wine experts rated the quality of each wine between 0 (very bad) and 10 (very excellent). More information can be found here..
Question: What variables contribute to making the best quality Red Wine?
Dimensions?
## [1] 1599 13
Variables?
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Field Descriptions:
Structure?
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Descriptive Stats?
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Thoughts at this stage - Quality range is between 3 and 8 - does this correlate with anything?
Mean/Median seem to be relatively close on all variables except total.sulfur.dioxide and chlorides - long tailed?
Next step to look at the distribution of each individual variable and how a log10 or sqrt transformation might affect this.
Transformations have an effect but once 8 outliers are removed (> 14) the distribution is quite normal. I also checked the outliers in case there were any significant results. As the quality is all between 5 and 7 where the majority of the results fall this is fine.
Quality Table: acidity > 14
##
## FALSE TRUE
## 3 10 0
## 4 53 0
## 5 677 4
## 6 637 1
## 7 196 3
## 8 18 0
As with fixed acidity. The best option seemed to be to remove 4 outliers (>1.2)
Quality Table: volatile acidity > 1.2
##
## FALSE TRUE
## 3 9 1
## 4 53 0
## 5 678 3
## 6 638 0
## 7 199 0
## 8 18 0
There are 132 0 values which seem to be pretty evenly distributed on quality so neither of the transformations are suitable. The best I can do here is remove 6 outliers which spreads the distribution out a bit.
Quality Table: citric acid == 0
##
## FALSE TRUE
## 3 7 3
## 4 43 10
## 5 624 57
## 6 584 54
## 7 191 8
## 8 18 0
Quality Table: citric acid > 0.75
##
## FALSE TRUE
## 3 10 0
## 4 52 1
## 5 679 2
## 6 637 1
## 7 197 2
## 8 18 0
Quality Table: residual sugar > 6
##
## FALSE TRUE
## 3 10 0
## 4 51 2
## 5 658 23
## 6 624 14
## 7 191 8
## 8 17 1
Outliers over 6 looks evenly spread.
Quality Table: chlorides > 0.25
##
## FALSE TRUE
## 3 9 1
## 4 52 1
## 5 665 16
## 6 632 6
## 7 198 1
## 8 18 0
Quality Table: free sulfur dioxide > 60
##
## FALSE TRUE
## 3 10 0
## 4 53 0
## 5 678 3
## 6 637 1
## 7 199 0
## 8 18 0
For total SO2 I chose to remove 3 outliers (>160). The distribution is still heavily positively skewed but neither transformation option seems to have a significant effect. The 3 outliers were quite high in quality so I double checked the mean for all data > 150 in case this was a significant trend.
Quality Table: total sulfur dioxide > 160
##
## FALSE TRUE
## 3 10 0
## 4 53 0
## 5 681 0
## 6 637 1
## 7 197 2
## 8 18 0
Double checking means (new then old).
## [1] 5.666667
## [1] 5.636023
Density: No change - normal already
pH: no change - normal already
Quality Table: sulphates > 1.5
##
## FALSE TRUE
## 3 10 0
## 4 52 1
## 5 677 4
## 6 635 3
## 7 199 0
## 8 18 0
No change - transformations have no significant impact, neither does removing the outliers.
Quality Table: alcohol > 14
##
## FALSE TRUE
## 3 10 0
## 4 53 0
## 5 680 1
## 6 638 0
## 7 199 0
## 8 18 0
The combined outlier removal resulted in 20 less rows:
## [1] 1579 13
Below is the ggpairs for the transformed data. I’ve colour coded the Pearson’s R results as follows:
yellow (0.3-0.5 small meaning) orange (0.5-0.7 moderate) red (0.7+ pretty large)
The corresponding plots are highighted in green.
What does this mean though?
I ended up plotting it on a flow diagram using draw.io. Colour coding applies to the arrows.
Conclusions:
Although Free and Total SO2 are highly correlated, they do not correlate with any other variables so I am not going to analyse them further.
Residual sugar seems to have a small correlation to density but nothing else. This may be due to the fact that vinho verde red wine isn’t intended to be sweet. I won’t analyse that further either.
Start off with 3 critical factors - Alcohol, Volatile Acid and Sulphates.
Link these critical factors to related variables and see if I can see any further patterns.
There majority of the data has quality level 5,6,7.
Some stats first:
## Source: local data frame [6 x 4]
##
## quality alcohol_mean alcohol_median n
## (int) (dbl) (dbl) (int)
## 1 3 9.850000 9.90 9
## 2 4 10.281731 10.00 52
## 3 5 9.880655 9.70 672
## 4 6 10.631234 10.50 635
## 5 7 11.474180 11.50 193
## 6 8 12.094444 12.15 18
Percent with quality 5,6,7 = 0.9493433
Initial Analysis:
95% of the data is in the quality bracket 5-7 The alcohol content goes up in line with the quality. Nothing <10 for top quality, nothing over 12 for lower The majority of mid range wines (5 & 6) = 10
In more detail..
Once the quality gets to 5 there is a clear positive trend.
What about density?, fixed acidity, chlorides, citric acid and pH?
How do they affect the alcohol level.
What makes it high?
I started with density. The density of the wine is known to be linked to the alcohol content so this correlation is not a surprise.
There seems to be a negative linear correlation between alcohol and density - the content is higher when the density is lower.
How about fixed acidity and density?
There seems to be a positive linear correlation here. The higher the density, the higher the fixed acidity.
How about chlorides and density?
Slight positive trend here but high distribution of quality. It seems more of a cluster that a trend and the quality is dispersed - dead end?
How about citric acid and density?
The smoother is confusing here. There is no clear linear regression. However, the correlation between citric acid and quality looks quite strong.
Citric acid is “found in small quantities and can add freshness and flavour to wines”. This could clearly have an impact on quality despite the fact that the quality ratings for 0 values are well distributed.
To double check, lets see what happens when we remove the 0 values.
Pearsons’R on quality is
## [1] 0.2349862
This is not wildly significant. Adding this to the fact that the data we removed did not show a trend of low quality data I’d say this is a dead end as well.
And lastly pH?
Looks like a strong negative trend here…
But does pH figure? Tricky to tell. The Ph levels do seem higher at the ‘positive’ end?
Possible hypothesis:
low density levels + low fixed.acidity + higher ph levels = higher alcohol content = better quality wine?
Some stats first:
## Source: local data frame [6 x 4]
##
## quality vol.acid_mean vol.acid_median n
## (int) (dbl) (dbl) (int)
## 1 3 0.8072222 0.8150 9
## 2 4 0.6973077 0.6725 52
## 3 5 0.5741518 0.5800 672
## 4 6 0.4985276 0.5000 635
## 5 7 0.4057254 0.3700 193
## 6 8 0.4233333 0.3700 18
It looks like a gradual descent in acidity from low to high quality. The mean and median quite similar.
Not much more to glean from the histograms really. Distributions seem pretty similar on first glance.
This plot does reflect the downward trend but there is a lot of noise. The smoother may be misleading?
Let’s have a look at mean and quantiles. Also let’s check if it’s worth adding log10 or sqrt transformations.
No to transformation. Slight downward trend still visible.
Where’s the correlation between Volatile and Fixed Acidity?
Seems to be a bit of a cluster - the majority of results have low fixed and volatile acidity. It looks like fixed acidity and volatile acidity are related and pretty constant - as you’d expect.
There seems to be no particular correlation with quality so perhaps a dead end?
Possible hypothesis - Volatile Acidity may have a small impact on quality - the lower the acidity, the better the wine. This may or may not be significant.
Some basic analysis first:
## Source: local data frame [6 x 4]
##
## quality sulph_mean sulph_median n
## (int) (dbl) (dbl) (int)
## 1 3 0.5888889 0.55 9
## 2 4 0.5694231 0.56 52
## 3 5 0.6194792 0.58 672
## 4 6 0.6755906 0.64 635
## 5 7 0.7436788 0.74 193
## 6 8 0.7677778 0.74 18
Looks like a pretty clear positive trend in both mean and median.
Lets look at a facet grid - I’ve transformed sulphates by log10 as that was the decision I made earlier.
Histograms back this up..
What about mean and quantiles?
Slight positive trend still visible..
What about citric acid?
The distribution is pretty constant. There seem to be more low quality wines for lower sulphate levels (as we discovered earlier) but the addition of citric acid levels does not impact that in any way.
or chlorides?
This backs up the theory that higher sulphate levels may = better quality wine.
There is a clear constant and logical cluster between chlorides and sulphates so chlorides are unlikely to have a follow through impact on quality.
Possible hypothesis - Sulphates may have a small impact on quality - the higher the sulphate level, the better the wine.
Three possible hypotheses so far:
Plot One: Alcohol, Quality and related values - following the trail
Explanation:
Possible Conclusion:
low density + low fixed acidity + high ph = high alcohol = better wine?
Here’s the plot to back me up…
A bit more statistical analysis:
##
## Calls:
## m4: lm(formula = I(alcohol) ~ I(quality) + density + fixed.acidity +
## pH, data = rwine)
##
## ===============================
## (Intercept) 440.405***
## (13.269)
## I(quality) 0.371***
## (0.023)
## density -448.051***
## (13.486)
## fixed.acidity 0.452***
## (0.019)
## pH 3.247***
## (0.159)
## -------------------------------
## R-squared 0.6
## adj. R-squared 0.6
## sigma 0.7
## F 564.3
## p 0.0
## Log-likelihood -1623.6
## Deviance 722.9
## AIC 3259.3
## BIC 3291.5
## N 1579
## ===============================
Conclusion:
The graphs indicate a clear link between alcohol content, density and wine quality. Factoring in a lower fixed acidity level and a higher pH level makes the regression stronger.
The stats are pretty good too. R-squared implies that 60% of the variance is explained by our variables. The F-score, Deviance, AIC and BIC are all very high though. Is this the result of linking 5 such diverse variables?
Plot Two: Sulphate and Quality
Explanation:
There is a visible positive trend between quality and sulphates.
Possible Conclusion:
higher sulphate levels = better quality wine
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
A bit more statistical analysis:
##
## Calls:
## m1: lm(formula = I(log10(sulphates)) ~ I(quality), data = rwine)
##
## =============================
## (Intercept) -0.411***
## (0.016)
## I(quality) 0.038***
## (0.003)
## -----------------------------
## R-squared 0.1
## adj. R-squared 0.1
## sigma 0.1
## F 182.1
## p 0.0
## Log-likelihood 1546.1
## Deviance 13.0
## AIC -3086.3
## BIC -3070.2
## N 1579
## =============================
Conclusion: The plot shows a clear positive correlation between sulphates and wine quality but the stats tell a different story.
R-squared only explains 10% of the variance which does not indicate high significance. The F-score is high, possibly too high, as are the AIC, BIC and Deviance. All in all I’d conclude that the results are unlikely to be significant but I’d keep observing sulphate leves when a new dataset comes in.
Plot Three: Volatile Acidity and Quality
Explanation:
There is a visible negative trend between quality and volatile acidity.
PossibleConclusion:
lower volatile acidity levels = better quality wine
A bit more statistical analysis:
##
## Calls:
## m1: lm(formula = I(volatile.acidity) ~ I(quality), data = rwine)
##
## =============================
## (Intercept) 0.993***
## (0.029)
## I(quality) -0.083***
## (0.005)
## -----------------------------
## R-squared 0.1
## adj. R-squared 0.1
## sigma 0.2
## F 269.9
## p 0.0
## Log-likelihood 647.2
## Deviance 40.7
## AIC -1288.4
## BIC -1272.3
## N 1579
## =============================
Conclusion: The plots shows a visible link between Volatile Acidity and Quality but as with Sulphates the statistics don’t back this up very well.
R-squared only explains 10% of the variance which does not indicate high significance. The F-score is quite high as are the AIC, BIC and Deviance. Again, I’d conclude that the results are unlikely to be significant but I’d keep observing volatile acidity levels when a new dataset comes in.
The results indicate that there is a strong likelihood that the following factors contribute to high quality wine:
They also indicate a possible linke between these variables, but for this dataset it does not look significant.
The ideal follow-up would be a matching analysis on subsequent years. The dataset was recorded in 2009. If similar data has been gathered over the past six years that would allow us to come to more reliable, robust conclusions.
It is also worth bearing in mind that the ‘quality’ is a sensory score given by 3 professionals. If the same professionals were involved each year that would make the analysis more robust.
This project has been a learning curve in action. It has been a while since I took the Data Analysis with R module so I’ve had to refamilarise myself with RStudio, ggplot and a lot more. After the 1st submission I had to rewrite my R file in RMD as I forgot about this and was working on an RScript!
It’s been interesting as back in January I had a go analysing the white wine data using Excel and Python. This was arduous and inconclusive so it’s been great to really dig into the red wine data and come to some logical conclusions.
I started with a GGPair and quickly realised that I would have to transform the data first as the distribution was only normal on a couple of variables. My first mistake was to try and hurry the process and plot standard, log10 and sqrt histograms for all the univariates on one chart. While this saved spaced, I missed various nuances.
After the 1st submission I looked again in more detail and ended up removing outliers for three of the variables instead. This was time consuming but very valuable.
The other thing I had difficulty with was interpreting the GGPair and therefore deciding where to go next. I found the flow diagram really useful both for deciding the critical factors and the variables that might impact them.
For the critical factors, it was the challenge of getting to know ggplot again and working out the best geom and scale to convey meaning. I found the percentiles and mean very useful. The quality scale took a while but I got there in the end.
I also failed to look at statistics first time round. The lm stats were really helpful either in supporting or making me sceptical about the plot indications.
Overall, this has been a positive experience though as I now feel confident enough to approach a new R analysis with new data.
A possible next step could be to apply machine learning: Can I predict top quality wines based on the links I’ve discovered? What does KMeansBest say about my features? Perhaps I could create some new composite features? PCA of alcohol/density springs to mind?
Or I could repeat the process with white wine data and look for similarities and differences?